Multivariate Data Embeddings¶
  • Generate Multivariate Time Series Data: generate time series data for multiple variables, such as temperature, humidity, wind speed, and atmospheric pressure.
  • Embed Data Using a Pre-trained Model: Use open-source libraries like sentence-transformers models to embed the data. Since time series data typically isn't natural language text, it required to either flatten or represent the data in a suitable way for embedding.
    • The SentenceTransformer model to encode each row of data into embeddings. Since the model is designed for text data, the data in each row is represented as a string, combining the date and values for each variable.
  • Store the Embeddings in a FAISS Vector Database: Store the generated embeddings in FAISS to allow efficient similarity search. Implement RAG for Querying: Allow querying the vector database based on a user input, such as querying a specific variable (e.g., temperature, wind speed).
    • Query and Retrieval: When a user provides a query (e.g., "temperature: 25"), we encode the query and perform a nearest neighbor search in the FAISS index to retrieve the top 3 most similar rows based on the embedding distance.
  • Visualize Results: Display the original and predicted data, including embedding similarity, in markdown and use radar charts for visualization.
    • Original Data: The original multivariate time series data is displayed as a markdown table.
    • Predicted Data: The predicted data (top 3 most similar rows) is displayed in a markdown table with distances.
    • Radar Chart: A radar chart is shown for the first predicted row, visualizing the relative values of the features.
In [ ]:
%pip install -q faiss-cpu sentence-transformers pandas numpy matplotlib
Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
petastorm 0.12.1 requires pyspark>=2.1.0, which is not installed.
databricks-feature-store 0.14.3 requires pyspark<4,>=3.1.2, which is not installed.
ydata-profiling 4.2.0 requires numpy<1.24,>=1.16.0, but you have numpy 2.1.3 which is incompatible.
ydata-profiling 4.2.0 requires scipy<1.11,>=1.4.1, but you have scipy 1.14.1 which is incompatible.
numba 0.55.1 requires numpy<1.22,>=1.18, but you have numpy 2.1.3 which is incompatible.
mleap 0.20.0 requires scikit-learn<0.23.0,>=0.22.0, but you have scikit-learn 1.1.1 which is incompatible.
langchain 0.0.217 requires numpy<2,>=1, but you have numpy 2.1.3 which is incompatible.
databricks-feature-store 0.14.3 requires numpy<2,>=1.19.2, but you have numpy 2.1.3 which is incompatible.
Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.
In [ ]:
import numpy as np
import pandas as pd
import faiss
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

# Step 1: Generate the multivariate time series data
np.random.seed(0)  # For reproducibility
dates = pd.date_range(start='2024-01-01', periods=7, freq='D')
temperature = np.random.uniform(15, 30, size=7)  # Temperature in °C
humidity = np.random.uniform(30, 90, size=7)  # Humidity in %
wind_speed = np.random.uniform(0, 15, size=7)  # Wind speed in km/h
pressure = np.random.uniform(980, 1050, size=7)  # Atmospheric pressure in hPa

# Combine into a DataFrame
data = {
    'date': dates,
    'temperature': temperature,
    'humidity': humidity,
    'wind_speed': wind_speed,
    'pressure': pressure
}

df = pd.DataFrame(data)

# Step 2: Embedding the multivariate data using SentenceTransformer
def embed_data(df):
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')  # Using a pre-trained model for general embeddings
    
    # Create a string representation of each row to feed into the model
    texts = df.apply(lambda row: f"date: {row['date']} temperature: {row['temperature']} humidity: {row['humidity']} wind_speed: {row['wind_speed']} pressure: {row['pressure']}", axis=1).tolist()

    # Create embeddings
    embeddings = model.encode(texts)
    
    return embeddings

# Embedding the data
embeddings = embed_data(df)

# Step 3: Store embeddings in FAISS vector database
def store_embeddings(embeddings):
    embeddings = np.array(embeddings).astype('float32')
    index = faiss.IndexFlatL2(embeddings.shape[1])  # L2 distance metric
    index.add(embeddings)
    return index

# Store the embeddings in the FAISS index
index = store_embeddings(embeddings)

# Step 4: Retrieval-Augmented Generation (RAG) System to query embeddings
def retrieve_similar_data(query, index, k=3):
    model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
    query_embedding = model.encode([query]).astype('float32')
    D, I = index.search(query_embedding, k)  # D is distances, I is indices of nearest neighbors
    return I[0], D[0]

# Step 5: Display the original and predicted data

# Create a DataFrame for embeddings
original_embedding_df = pd.DataFrame(embeddings, columns=[f"dim_{i+1}" for i in range(embeddings.shape[1])])
original_embedding_len_df = len(original_embedding_df.columns)
original_embedding_row_count_df = original_embedding_df.shape[0]
# Select the first 5 columns
original_embedding_5_df = original_embedding_df.iloc[:,:5].head(5)

# Display the original multivariate data as markdown table
original_df = df.head(7)
print("\nMultivariate Time Series Weather Data:\n", original_df.to_markdown(tablefmt="github", index=False))

# Display the original multivariate data embeddings as markdown table
print("\nMultivariate Data Embeddings: (first 5 dimensions)")
print(f"For 7 rows and 5 columns of  Multivariate Data {original_embedding_len_df} vectors dimensions were created\n")
print(original_embedding_5_df.to_markdown(index=False))  # Display the embeddings in markdown table format

# Query input: Let's query by a specific variable
user_query = "temperature: 25"  # "humidity: 77" or "temperature: 25"
print(f"\nUser Query: {user_query}")

# Retrieve top 3 best matches based on the query
indices, distances = retrieve_similar_data(user_query, index)

# Get the top 3 predicted data
predicted_df = df.iloc[indices].reset_index(drop=True)
predicted_df['embedding_distance'] = distances
predicted_df = predicted_df.sort_values(by='embedding_distance', ascending=False).reset_index(drop=True)

print("\nPredicted Multivariate Data:\n", predicted_df.to_markdown(tablefmt="github", index=False))

# Step 6: Visualize predicted data as a multi-group spider (radar) chart
def plot_multi_group_spider_chart(data, title):
    categories = ['temperature', 'humidity', 'wind_speed', 'pressure']
    
    # Normalize the data to [0, 1] range for radar chart
    scaler = MinMaxScaler()
    values = scaler.fit_transform(data[categories].values)
    
    # Setup the angles for the radar chart
    num_vars = len(categories)
    angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
    angles += angles[:1]  # Close the loop
    
    fig, ax = plt.subplots(figsize=(6, 6), dpi=80, subplot_kw=dict(polar=True)) # figsize=(8, 8)
    
    # Plot each group (predicted data rows)
    for i, row in data.iterrows():
        row_values = np.concatenate([values[i], values[i][:1]])  # Close the loop for the group
        ax.fill(angles, row_values, alpha=0.25)
        ax.plot(angles, row_values, label=f'Prediction {i+1}', linewidth=2)
    
    # Set the ticks and labels for the axes
    ax.set_yticklabels([])  # Hide radial labels
    ax.set_xticks(angles[:-1])  # Set the x-ticks to be the categories
    ax.set_xticklabels(categories, fontsize=12)  # Set the labels of each axis
    
    # Add a legend
    ax.legend(loc='upper right', bbox_to_anchor=(1.2, 1.1), fontsize=12)

    plt.title(title, size=14)
    plt.show()

# Display multi-group spider chart for the top 3 predicted data rows
plot_multi_group_spider_chart(predicted_df, "Predicted Multivariate Data")
Multivariate Time Series Weather Data:
 | date                |   temperature |   humidity |   wind_speed |   pressure |
|---------------------|---------------|------------|--------------|------------|
| 2024-01-01 00:00:00 |       23.2322 |    83.5064 |     1.06554  |   1035.94  |
| 2024-01-02 00:00:00 |       25.7278 |    87.8198 |     1.30694  |   1012.3   |
| 2024-01-03 00:00:00 |       24.0415 |    53.0065 |     0.303276 |   1034.64  |
| 2024-01-04 00:00:00 |       23.1732 |    77.5035 |    12.4893   |    988.279 |
| 2024-01-05 00:00:00 |       21.3548 |    61.7337 |    11.6724   |   1024.79  |
| 2024-01-06 00:00:00 |       24.6884 |    64.0827 |    13.0502   |    990.035 |
| 2024-01-07 00:00:00 |       21.5638 |    85.5358 |    14.6793   |   1046.13  |

Multivariate Data Embeddings: (first 5 dimensions)
For 7 rows and 5 columns of  Multivariate Data 384 vectors dimensions were created

|     dim_1 |    dim_2 |    dim_3 |    dim_4 |    dim_5 |
|----------:|---------:|---------:|---------:|---------:|
| -0.368792 | 0.408502 | 0.588999 | 0.682211 | 0.617992 |
| -0.388623 | 0.412094 | 0.581821 | 0.682733 | 0.61162  |
| -0.436087 | 0.447782 | 0.602891 | 0.715155 | 0.631992 |
| -0.432618 | 0.448834 | 0.549173 | 0.673219 | 0.551741 |
| -0.392174 | 0.364377 | 0.567262 | 0.674651 | 0.493109 |

User Query: temperature: 25

Predicted Multivariate Data:
 | date                |   temperature |   humidity |   wind_speed |   pressure |   embedding_distance |
|---------------------|---------------|------------|--------------|------------|----------------------|
| 2024-01-02 00:00:00 |       25.7278 |    87.8198 |     1.30694  |    1012.3  |              49.7495 |
| 2024-01-07 00:00:00 |       21.5638 |    85.5358 |    14.6793   |    1046.13 |              49.5342 |
| 2024-01-03 00:00:00 |       24.0415 |    53.0065 |     0.303276 |    1034.64 |              48.7736 |
No description has been provided for this image